Prosper Loan Data Exploration¶

By Adeniran Ridwan¶

Introduction¶

This is an exploratory data visualization project on the Prosper Loan datasets. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.

Preliminary Wrangling¶

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px

import warnings
warnings.simplefilter("ignore")
In [2]:
#importing the Medical_appointment dataset
data = pd.read_csv(r'C:\Users\Master Ridwan\Desktop\my data\prosperLoanData.csv', header = 0)
In [3]:
data.head(5)
Out[3]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 ... -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 ... -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 ... -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 ... -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20

5 rows × 81 columns

In [4]:
#check for the structure of the dataset
data.shape
Out[4]:
(113937, 81)
In [5]:
#Check for more imformation about the dataset(Non-Null Count, Dtype )
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss                        84853 non-null   float64
 12  EstimatedReturn                      84853 non-null   float64
 13  ProsperRating (numeric)              84853 non-null   float64
 14  ProsperRating (Alpha)                84853 non-null   object 
 15  ProsperScore                         84853 non-null   float64
 16  ListingCategory (numeric)            113937 non-null  int64  
 17  BorrowerState                        108422 non-null  object 
 18  Occupation                           110349 non-null  object 
 19  EmploymentStatus                     111682 non-null  object 
 20  EmploymentStatusDuration             106312 non-null  float64
 21  IsBorrowerHomeowner                  113937 non-null  bool   
 22  CurrentlyInGroup                     113937 non-null  bool   
 23  GroupKey                             13341 non-null   object 
 24  DateCreditPulled                     113937 non-null  object 
 25  CreditScoreRangeLower                113346 non-null  float64
 26  CreditScoreRangeUpper                113346 non-null  float64
 27  FirstRecordedCreditLine              113240 non-null  object 
 28  CurrentCreditLines                   106333 non-null  float64
 29  OpenCreditLines                      106333 non-null  float64
 30  TotalCreditLinespast7years           113240 non-null  float64
 31  OpenRevolvingAccounts                113937 non-null  int64  
 32  OpenRevolvingMonthlyPayment          113937 non-null  float64
 33  InquiriesLast6Months                 113240 non-null  float64
 34  TotalInquiries                       112778 non-null  float64
 35  CurrentDelinquencies                 113240 non-null  float64
 36  AmountDelinquent                     106315 non-null  float64
 37  DelinquenciesLast7Years              112947 non-null  float64
 38  PublicRecordsLast10Years             113240 non-null  float64
 39  PublicRecordsLast12Months            106333 non-null  float64
 40  RevolvingCreditBalance               106333 non-null  float64
 41  BankcardUtilization                  106333 non-null  float64
 42  AvailableBankcardCredit              106393 non-null  float64
 43  TotalTrades                          106393 non-null  float64
 44  TradesNeverDelinquent (percentage)   106393 non-null  float64
 45  TradesOpenedLast6Months              106393 non-null  float64
 46  DebtToIncomeRatio                    105383 non-null  float64
 47  IncomeRange                          113937 non-null  object 
 48  IncomeVerifiable                     113937 non-null  bool   
 49  StatedMonthlyIncome                  113937 non-null  float64
 50  LoanKey                              113937 non-null  object 
 51  TotalProsperLoans                    22085 non-null   float64
 52  TotalProsperPaymentsBilled           22085 non-null   float64
 53  OnTimeProsperPayments                22085 non-null   float64
 54  ProsperPaymentsLessThanOneMonthLate  22085 non-null   float64
 55  ProsperPaymentsOneMonthPlusLate      22085 non-null   float64
 56  ProsperPrincipalBorrowed             22085 non-null   float64
 57  ProsperPrincipalOutstanding          22085 non-null   float64
 58  ScorexChangeAtTimeOfListing          18928 non-null   float64
 59  LoanCurrentDaysDelinquent            113937 non-null  int64  
 60  LoanFirstDefaultedCycleNumber        16952 non-null   float64
 61  LoanMonthsSinceOrigination           113937 non-null  int64  
 62  LoanNumber                           113937 non-null  int64  
 63  LoanOriginalAmount                   113937 non-null  int64  
 64  LoanOriginationDate                  113937 non-null  object 
 65  LoanOriginationQuarter               113937 non-null  object 
 66  MemberKey                            113937 non-null  object 
 67  MonthlyLoanPayment                   113937 non-null  float64
 68  LP_CustomerPayments                  113937 non-null  float64
 69  LP_CustomerPrincipalPayments         113937 non-null  float64
 70  LP_InterestandFees                   113937 non-null  float64
 71  LP_ServiceFees                       113937 non-null  float64
 72  LP_CollectionFees                    113937 non-null  float64
 73  LP_GrossPrincipalLoss                113937 non-null  float64
 74  LP_NetPrincipalLoss                  113937 non-null  float64
 75  LP_NonPrincipalRecoverypayments      113937 non-null  float64
 76  PercentFunded                        113937 non-null  float64
 77  Recommendations                      113937 non-null  int64  
 78  InvestmentFromFriendsCount           113937 non-null  int64  
 79  InvestmentFromFriendsAmount          113937 non-null  float64
 80  Investors                            113937 non-null  int64  
dtypes: bool(3), float64(50), int64(11), object(17)
memory usage: 68.1+ MB

What is the structure of your dataset?¶

The dataset consists of 113937 rows of loan records and 81 features.

What are the main features of interest in the dataset?¶

The main features of interest are the attributes that surrounds the borrowers in relation to the loan

What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶

  • ListingKey: Unique key for each listing, same value as the 'key' used in the listing object in the API.
  • ListingCreationDate: The date the listing was created.
  • MemberKey: The unique key that is associated with the borrower. This is the same identifier that is used in the API member object.
  • EmploymentStatus:The employment status of the borrower at the time they posted the listing.
  • Occupation:The Occupation selected by the Borrower at the time they created the listing.
  • IncomeVerifiable: The borrower indicated they have the required documentation to support their income.
  • StatedMonthlyIncome: The monthly income the borrower stated at the time the listing was created.
  • IncomeRange: The income range of the borrower at the time the listing was created.
  • BorrowerAPR:The Borrower's Annual Percentage Rate (APR) for the loan.
  • BorrowerRate:The Borrower's interest rate for this loan.
  • BorrowerState: The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
  • DebtToIncomeRatio: The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
  • IsBorrowerHomeowner: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
  • CreditScoreRangeLower: The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
  • CreditScoreRangeUpper: The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
  • LoanStatus: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress,
  • LoanKey: Unique key for each loan. This is the same key that is used in the API.
  • LoanOriginalAmount: The origination amount of the loan.
  • Term: The length of the loan expressed in months.
  • LoanOriginationDate: The date the loan was originated.
  • LoanOriginationQuarter: The quarter in which the loan was originated.
  • ClosedDate: Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses.
  • TotalProsperLoans: Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans.

Univariate Exploration¶

In [6]:
#lets copy the original data
loan_df = data.copy()
In [7]:
#selecting features of interest
cols =['ListingKey', 'ListingCreationDate', 'MemberKey','EmploymentStatus', 'Occupation','IncomeVerifiable','StatedMonthlyIncome', 'IncomeRange','BorrowerAPR','BorrowerRate','BorrowerState','DebtToIncomeRatio','IsBorrowerHomeowner','CreditScoreRangeLower','CreditScoreRangeUpper','LoanKey','LoanOriginalAmount','LoanOriginationDate','LoanOriginationQuarter','Term','ClosedDate', 'TotalProsperLoans']
data_df = loan_df[cols]
data_df.head()
Out[7]:
ListingKey ListingCreationDate MemberKey EmploymentStatus Occupation IncomeVerifiable StatedMonthlyIncome IncomeRange BorrowerAPR BorrowerRate ... IsBorrowerHomeowner CreditScoreRangeLower CreditScoreRangeUpper LoanKey LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter Term ClosedDate TotalProsperLoans
0 1021339766868145413AB3B 2007-08-26 19:09:29.263000000 1F3E3376408759268057EDA Self-employed Other True 3083.333333 $25,000-49,999 0.16516 0.1580 ... True 640.0 659.0 E33A3400205839220442E84 9425 2007-09-12 00:00:00 Q3 2007 36 2009-08-14 00:00:00 NaN
1 10273602499503308B223C1 2014-02-27 08:28:07.900000000 1D13370546739025387B2F4 Employed Professional True 6125.000000 $50,000-74,999 0.12016 0.0920 ... False 680.0 699.0 9E3B37071505919926B1D82 10000 2014-03-03 00:00:00 Q1 2014 36 NaN NaN
2 0EE9337825851032864889A 2007-01-05 15:00:47.090000000 5F7033715035555618FA612 Not available Other True 2083.333333 Not displayed 0.28269 0.2750 ... False 480.0 499.0 6954337960046817851BCB2 3001 2007-01-17 00:00:00 Q1 2007 36 2009-12-17 00:00:00 NaN
3 0EF5356002482715299901A 2012-10-22 11:02:35.010000000 9ADE356069835475068C6D2 Employed Skilled Labor True 2875.000000 $25,000-49,999 0.12528 0.0974 ... True 800.0 819.0 A0393664465886295619C51 10000 2012-11-01 00:00:00 Q4 2012 36 NaN NaN
4 0F023589499656230C5E3E2 2013-09-14 18:38:39.097000000 36CE356043264555721F06C Employed Executive True 9583.333333 $100,000+ 0.24614 0.2085 ... True 680.0 699.0 A180369302188889200689E 15000 2013-09-20 00:00:00 Q3 2013 36 NaN 1.0

5 rows × 22 columns

In [8]:
#Check for the shape
data_df.shape
Out[8]:
(113937, 22)
In [9]:
#Check for infomation about the new features
data_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 22 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   ListingKey              113937 non-null  object 
 1   ListingCreationDate     113937 non-null  object 
 2   MemberKey               113937 non-null  object 
 3   EmploymentStatus        111682 non-null  object 
 4   Occupation              110349 non-null  object 
 5   IncomeVerifiable        113937 non-null  bool   
 6   StatedMonthlyIncome     113937 non-null  float64
 7   IncomeRange             113937 non-null  object 
 8   BorrowerAPR             113912 non-null  float64
 9   BorrowerRate            113937 non-null  float64
 10  BorrowerState           108422 non-null  object 
 11  DebtToIncomeRatio       105383 non-null  float64
 12  IsBorrowerHomeowner     113937 non-null  bool   
 13  CreditScoreRangeLower   113346 non-null  float64
 14  CreditScoreRangeUpper   113346 non-null  float64
 15  LoanKey                 113937 non-null  object 
 16  LoanOriginalAmount      113937 non-null  int64  
 17  LoanOriginationDate     113937 non-null  object 
 18  LoanOriginationQuarter  113937 non-null  object 
 19  Term                    113937 non-null  int64  
 20  ClosedDate              55089 non-null   object 
 21  TotalProsperLoans       22085 non-null   float64
dtypes: bool(2), float64(7), int64(2), object(11)
memory usage: 17.6+ MB
In [10]:
#check for the descriptive statistics of numerical features
data_df.describe()
Out[10]:
StatedMonthlyIncome BorrowerAPR BorrowerRate DebtToIncomeRatio CreditScoreRangeLower CreditScoreRangeUpper LoanOriginalAmount Term TotalProsperLoans
count 1.139370e+05 113912.000000 113937.000000 105383.000000 113346.000000 113346.000000 113937.00000 113937.000000 22085.000000
mean 5.608026e+03 0.218828 0.192764 0.275947 685.567731 704.567731 8337.01385 40.830248 1.421100
std 7.478497e+03 0.080364 0.074818 0.551759 66.458275 66.458275 6245.80058 10.436212 0.764042
min 0.000000e+00 0.006530 0.000000 0.000000 0.000000 19.000000 1000.00000 12.000000 0.000000
25% 3.200333e+03 0.156290 0.134000 0.140000 660.000000 679.000000 4000.00000 36.000000 1.000000
50% 4.666667e+03 0.209760 0.184000 0.220000 680.000000 699.000000 6500.00000 36.000000 1.000000
75% 6.825000e+03 0.283810 0.250000 0.320000 720.000000 739.000000 12000.00000 36.000000 2.000000
max 1.750003e+06 0.512290 0.497500 10.010000 880.000000 899.000000 35000.00000 60.000000 8.000000
  • The Datatype of the column 'ListingCreationDate', 'LoanOriginationDate' and 'ClosedDate' was stored as an object. so it has to be changed to datetime dtype
In [11]:
#change the ListingCreationDate  and ClosedDate to datetime format
data_df[['ListingCreationDate', 'LoanOriginationDate', 'ClosedDate']] =data_df[['ListingCreationDate', 'LoanOriginationDate', 'ClosedDate']].apply(pd.to_datetime, errors = 'coerce')

Create New Column in order to help explore more out of the data¶

Feature Engineering

In [12]:
#extracting the specific Listing Creation Day and Listing Creation Month from the ListingCreationDate
data_df['ListingCreationDay'] = data_df['ListingCreationDate'].dt.day_name()
data_df['ListingCreationMonth'] = data_df['ListingCreationDate'].dt.month_name()
In [13]:
#extracting the specific Loan Origination Day and Loan Origination Month from the LoanOriginationDate
data_df['LoanOriginationDay'] = data_df['LoanOriginationDate'].dt.day_name()
data_df['LoanOriginationMonth'] = data_df['LoanOriginationDate'].dt.month_name()
data_df['LoanOriginationYear'] = pd.DatetimeIndex(data_df['LoanOriginationDate']).year
In [14]:
#extracting the specific ClosedDay and ClosedMonth from the ClosedDate
data_df['ClosedDay'] = data_df['ClosedDate'].dt.day_name()
data_df['ClosedMonth'] = data_df['ClosedDate'].dt.month_name()
data_df['ClosedYear'] = pd.DatetimeIndex(data_df['ClosedDate']).year
In [15]:
#Drop the columns that wont be neccessary for further analysis
data_df.drop(['ListingCreationDate','LoanOriginationDate','ClosedDate'], axis = 1, inplace = True)
  • New columns were created for each specific ListingCreationDay, ListingCreationMonth, ListingCreationYear as well as ClosedDay, ClosedMonth, ClosedYear. This will help us explore more out of the data
In [16]:
#Show the first 5 cols
data_df.head()
Out[16]:
ListingKey MemberKey EmploymentStatus Occupation IncomeVerifiable StatedMonthlyIncome IncomeRange BorrowerAPR BorrowerRate BorrowerState ... Term TotalProsperLoans ListingCreationDay ListingCreationMonth LoanOriginationDay LoanOriginationMonth LoanOriginationYear ClosedDay ClosedMonth ClosedYear
0 1021339766868145413AB3B 1F3E3376408759268057EDA Self-employed Other True 3083.333333 $25,000-49,999 0.16516 0.1580 CO ... 36 NaN Sunday August Wednesday September 2007 Friday August 2009.0
1 10273602499503308B223C1 1D13370546739025387B2F4 Employed Professional True 6125.000000 $50,000-74,999 0.12016 0.0920 CO ... 36 NaN Thursday February Monday March 2014 NaN NaN NaN
2 0EE9337825851032864889A 5F7033715035555618FA612 Not available Other True 2083.333333 Not displayed 0.28269 0.2750 GA ... 36 NaN Friday January Wednesday January 2007 Thursday December 2009.0
3 0EF5356002482715299901A 9ADE356069835475068C6D2 Employed Skilled Labor True 2875.000000 $25,000-49,999 0.12528 0.0974 GA ... 36 NaN Monday October Thursday November 2012 NaN NaN NaN
4 0F023589499656230C5E3E2 36CE356043264555721F06C Employed Executive True 9583.333333 $100,000+ 0.24614 0.2085 MN ... 36 1.0 Saturday September Friday September 2013 NaN NaN NaN

5 rows × 27 columns

Question: Which Employment Status does the dataset contain?

In [46]:
def Countplot(col):
    plt.figure(figsize = (14,8))
    countplot  =sns.countplot(x = col, data = data_df, color = 'cornflowerblue' )
    sns.set_style("darkgrid")
    plt.tight_layout()
    return countplot
In [47]:
#Show the plot Distribution of Employment Status by counts
Countplot('EmploymentStatus')
plt.title('Total Sum of Borrower Per EmploymentStatus', size = 15)
Out[47]:
Text(0.5, 1.0, 'Total Sum of Borrower Per EmploymentStatus')
  • The Visualization shows the different category of employment Status we have in the datasets. from this visualization, it shows that most most borrower are employment while a very few of them are retired.

Question: what is the total sum of borrower we have in each states in the datasets?

In [19]:
#Show the plot Distribution of Employment Status by counts
Countplot('BorrowerState')
plt.title('Total Sum of Borrower Per EmploymentStatus', size = 15)
Out[19]:
Text(0.5, 1.0, 'Total Sum of Borrower Per EmploymentStatus')
  • This Visualization shows different Borrower State we have in the datasets. from this visualization, the State with code CA tends to have the highest number of borrower.

Question: what is the total sum of borrower we have in the top Ten (10)occupations in the datasets?

In [20]:
#Show the plot Distribution of top 10 Occupation by counts
plt.figure(figsize = (14,8))
data_df['Occupation'].value_counts().head(10).plot.bar()
plt.legend()
plt.xlabel('Occupation')
plt.ylabel('Counts')
sns.set_style("darkgrid")
plt.tight_layout()
plt.title('Total Sum of Borrower Per Occupation', size = 15)
Out[20]:
Text(0.5, 1.0, 'Total Sum of Borrower Per Occupation')
  • The above Visualization shows the different category of occupation we have in the datasets. from this visualization, it shows that there are much borrowers who did not indicates there occupations. They are categorise as Other. However, according to indicated occupations, the professional have the highest count while the clerical have the least count.

Question: what is the distribution of borrower based on their Stated Monthly Income?

In [21]:
#Create an Histogram for Stated Monthly Income Distribution
plt.figure(figsize = [15, 6])
bin= np.arange(0, data_df['StatedMonthlyIncome'].max()+600, 600)
plt.hist(data= data_df, x = 'StatedMonthlyIncome', bins= bin)
plt.xlim(0, 20000)
plt.xlabel('Stated Monthly Income')
plt.ylabel('Counts')
plt.title('Distribution of Stated Monthly Income', size = 15)
Out[21]:
Text(0.5, 1.0, 'Distribution of Stated Monthly Income')
  • This Visualization shows that there are more borrower with stated monthly income that is less than 5000

Question: what is the distribution of borrower based on the Loan Original Amount?

In [22]:
#Create an Histogram for Loan Original Amount
plt.figure(figsize = [15, 6])
bin= np.arange(0, data_df['LoanOriginalAmount'].max()+600, 600)
plt.hist(data= data_df, x = 'LoanOriginalAmount', bins= bin)
plt.xlim(0, 20000)
plt.xlabel('Loan Original Amount')
plt.ylabel('Counts')
plt.title(' Loan Original Amount Distribution', size = 15)
Out[22]:
Text(0.5, 1.0, ' Loan Original Amount Distribution')
  • This Visualization shows the histogram of the Original Loan amount. From the visualization, It indicates there are more people that borrow about 4000 loan amount than other loan amount.

Question: What is the distribution of borrower based on their Income Range?

In [23]:
#Show barchart for Income Range Distribution
plt.figure(figsize = [15, 8])
color_pal = sns.color_palette()[0]
order_type = ['$0', '$1-24,999', '$25,000-49,999', '$50,000-74,999','$75,000-99,999', '$100,000+', 'not displayed', 'Not employed']
sns.countplot(data= data_df, x= 'IncomeRange', color = color_pal, order = order_type)
plt.xlabel("Income Range")
plt.ylabel("Counts")
plt.title("Distribution of Income Range")
Out[23]:
Text(0.5, 1.0, 'Distribution of Income Range')
  • This visualization shows that most people that borrow tends to have an income range of 25,000 - 50,000

Question: what is the distribution of borrower based on the Borrower Rate Distribution?

In [24]:
#Create an Histogram for Borrower Rate Distribution
plt.figure(figsize = [15, 6])
bin= np.arange(0, data_df['BorrowerRate'].max()+0.009,0.009)
plt.hist(data= data_df, x = 'BorrowerRate', bins= bin)
plt.xlabel('Borrower Rate')
plt.xlim(0, 0.40)
plt.ylabel('Counts')
plt.title(' Borrower Rate Distribution', size = 15)
Out[24]:
Text(0.5, 1.0, ' Borrower Rate Distribution')
  • This visualization shows that most of the borrowers have a borrower rate of 0.15

Question: which day of the week did Listing Creation occur most?

In [25]:
#show a doughnut chat Dritribution of Listing Creation Day
plt.figure(figsize = (10, 8))
sorted_counts = data_df['ListingCreationDay'].value_counts()
labels =  ['Monday', 'Tuesday', 'Wednesday', 'Thurday', 'Friday', 'Saturday', 'Sunday']

plt.pie(sorted_counts, labels = labels, startangle = 90,
        counterclock = False, wedgeprops = {'width' : 0.4}, autopct="%1.1f%%")
plt.axis('square')
plt.legend()
plt.title('Listing Creation Day Ditribution', size =15)
Out[25]:
Text(0.5, 1.0, 'Listing Creation Day Ditribution')
  • The Pie Chart shows that most loan was Listed on Tuesday while the least was listed on Sunday

Question: which day of the week did Loan Origination occur most?

In [26]:
#show a doughnut chat Dritribution of Loan Origination Day
plt.figure(figsize = (10, 8))
sorted_counts = data_df['LoanOriginationDay'].value_counts()
labels =  ['Monday', 'Tuesday', 'Wednesday', 'Thurday', 'Friday']

plt.pie(sorted_counts, labels = labels, startangle = 90,
        counterclock = False, wedgeprops = {'width' : 0.4}, autopct="%1.1f%%")
plt.axis('square')
plt.legend()
plt.title('Loan Origination Day Ditribution')
Out[26]:
Text(0.5, 1.0, 'Loan Origination Day Ditribution')
  • The Pie Chart shows that most loan was originated on Tuesday while the least was originated on Monday

Question: which day of the week is the most Closing Day?

In [27]:
#show a doughnut chat Dritribution of ClosedDay
plt.figure(figsize = (10, 8))
sorted_counts = data_df['ClosedDay'].value_counts()
labels =  ['Monday', 'Tuesday', 'Wednesday', 'Thurday', 'Friday', 'Saturday', 'Sunday']

plt.pie(sorted_counts, labels = labels, startangle = 90,
        counterclock = False, wedgeprops = {'width' : 0.4}, autopct="%1.1f%%")
plt.axis('square')
plt.legend()
plt.title('Loan Closed Day Day Ditribution')
Out[27]:
Text(0.5, 1.0, 'Loan Closed Day Day Ditribution')
  • The Pie Chart shows that most loan was Closed on Tuesday while the least was originated on Sunday

Question: which month of the year did Loan Origination occur most?

In [28]:
#plot for patients that showed up and those that do not show based on variable named Scholarship
plt.figure(figsize = (14,8))
order_type = ['January','February','March','April','May','June', 'July', 'August', 'September', 'October', 'November', 'December'] 
sns.countplot (x = "LoanOriginationMonth" , data = data_df, color = 'cornflowerblue', order = order_type)
plt.title('Loan Origination Month Distribution', size = 15)
plt.xticks(rotation = 90)
sns.set_style("darkgrid")
plt.tight_layout()
  • The Pie Chart shows that most loan was originated on 1th Month January while the least was originated on 4th Month April

Question: which year did Loan Origination occur most?

In [29]:
#Create an bar chart for Loan Origination Year Distribution
Countplot('LoanOriginationYear')
plt.title('Loan Origination Year Distribution', size = 15)
Out[29]:
Text(0.5, 1.0, 'Loan Origination Year Distribution')
  • The Pie Chart shows that most loan was originated in the year 2013 while the least was originated in the 2005

Question: What are the number Terms we have in the dataset?

In [30]:
#Create an bar chart for Term Distribution
plt.figure(figsize = [14, 8])
sorted_counts = data_df['Term'].value_counts()
plt.pie(sorted_counts, labels = sorted_counts.index, startangle = 90, counterclock = False);
plt.axis('square')
plt.legend()
plt.title('Number of Term Ditribution', size = 15)
Out[30]:
Text(0.5, 1.0, 'Number of Term Ditribution')
  • The Pie Chart shows that most loan borrower tends to have the TERM of 36 than those with the TERM of 60 and 12

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶

After I explored the distributions of various interested features in the dataset. It was discovered that Most borrowers are employed,very few of them are retired, they are mostly Professional and least of them are clerical. many of the borrower have an income range from ($)25,000-74,999. Their income ratio is right skewed. Exploring the distribution of different states,it was found out that the state CA have the highest number of borrowers. Most loans were Listed and originated on Tuesday while the least was Listed on Sunday and originated on monday. This year 2013 have the highest count of loan origination

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶

Of the features I investigated,I changed some feature's datatype such as ListCreationDate which was previously in object Dtype to the appropriate form which is datetime. I extracted some specific day, month and year from ListingCreationDay, LoanOriginationDay and ClosedDay. This will enable me to explore the dataset more.

Biviariate Exploration¶

In [31]:
data_df.head()
Out[31]:
ListingKey MemberKey EmploymentStatus Occupation IncomeVerifiable StatedMonthlyIncome IncomeRange BorrowerAPR BorrowerRate BorrowerState ... Term TotalProsperLoans ListingCreationDay ListingCreationMonth LoanOriginationDay LoanOriginationMonth LoanOriginationYear ClosedDay ClosedMonth ClosedYear
0 1021339766868145413AB3B 1F3E3376408759268057EDA Self-employed Other True 3083.333333 $25,000-49,999 0.16516 0.1580 CO ... 36 NaN Sunday August Wednesday September 2007 Friday August 2009.0
1 10273602499503308B223C1 1D13370546739025387B2F4 Employed Professional True 6125.000000 $50,000-74,999 0.12016 0.0920 CO ... 36 NaN Thursday February Monday March 2014 NaN NaN NaN
2 0EE9337825851032864889A 5F7033715035555618FA612 Not available Other True 2083.333333 Not displayed 0.28269 0.2750 GA ... 36 NaN Friday January Wednesday January 2007 Thursday December 2009.0
3 0EF5356002482715299901A 9ADE356069835475068C6D2 Employed Skilled Labor True 2875.000000 $25,000-49,999 0.12528 0.0974 GA ... 36 NaN Monday October Thursday November 2012 NaN NaN NaN
4 0F023589499656230C5E3E2 36CE356043264555721F06C Employed Executive True 9583.333333 $100,000+ 0.24614 0.2085 MN ... 36 1.0 Saturday September Friday September 2013 NaN NaN NaN

5 rows × 27 columns

In [32]:
data_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 27 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   ListingKey              113937 non-null  object 
 1   MemberKey               113937 non-null  object 
 2   EmploymentStatus        111682 non-null  object 
 3   Occupation              110349 non-null  object 
 4   IncomeVerifiable        113937 non-null  bool   
 5   StatedMonthlyIncome     113937 non-null  float64
 6   IncomeRange             113937 non-null  object 
 7   BorrowerAPR             113912 non-null  float64
 8   BorrowerRate            113937 non-null  float64
 9   BorrowerState           108422 non-null  object 
 10  DebtToIncomeRatio       105383 non-null  float64
 11  IsBorrowerHomeowner     113937 non-null  bool   
 12  CreditScoreRangeLower   113346 non-null  float64
 13  CreditScoreRangeUpper   113346 non-null  float64
 14  LoanKey                 113937 non-null  object 
 15  LoanOriginalAmount      113937 non-null  int64  
 16  LoanOriginationQuarter  113937 non-null  object 
 17  Term                    113937 non-null  int64  
 18  TotalProsperLoans       22085 non-null   float64
 19  ListingCreationDay      113937 non-null  object 
 20  ListingCreationMonth    113937 non-null  object 
 21  LoanOriginationDay      113937 non-null  object 
 22  LoanOriginationMonth    113937 non-null  object 
 23  LoanOriginationYear     113937 non-null  int64  
 24  ClosedDay               55089 non-null   object 
 25  ClosedMonth             55089 non-null   object 
 26  ClosedYear              55089 non-null   float64
dtypes: bool(2), float64(8), int64(3), object(14)
memory usage: 21.9+ MB

Questions: What is the relationship between Borrower Rate and Loan Original Amount?

In [33]:
sns.set(rc={'figure.figsize':(10,8)}, style="whitegrid")
sns.regplot(x='BorrowerRate', y='LoanOriginalAmount', data=data_df)
plt.xlabel('BorrowerRate')
plt.ylabel('LoanOriginalAmount')
plt.title('relationship between the Loan Original Amount and Borrower Rate')

data_corr = data_df.corr()

print("The Correlation Between BorrowerRate And LoanOriginalAmount is ",data_corr.loc['BorrowerRate','LoanOriginalAmount']);
The Correlation Between BorrowerRate And LoanOriginalAmount is  -0.3289599499385382
  • The Borrow rate decrease with increase in loan amount rate

Questions: What is the correlation between the numerical variables?

In [34]:
cols =['StatedMonthlyIncome','BorrowerAPR','BorrowerRate','DebtToIncomeRatio','CreditScoreRangeLower','CreditScoreRangeUpper','LoanOriginalAmount','Term']
corr_df = data_df[cols]
In [35]:
#shows the heapmap for the relationship betwen some numercal variables
plt.figure(figsize = (14,6))
heapmap  =sns.heatmap (corr_df.corr(), annot = True, cmap = 'RdYlGn', linewidths = .9)
plt.title('Relationship between two numerical variables', size = 20)
Out[35]:
Text(0.5, 1.0, 'Relationship between two numerical variables')
  • The Heatmap shows that there are more correlation between Borrower APR and Borrower Rate

Questions: What is distribution of StatedMonthlyIncome for each EmploymentStatus?

In [36]:
#show the boxplot of StatedMonthlyIncome distribution for EmploymentStatus 
import plotly.express as px
fig = px.box(data_df, x = 'EmploymentStatus', y = 'StatedMonthlyIncome',range_y=[0,25000])
fig.show()
  • The borrowers who are categorized as self employed and employed tends to have the highest stated monthly income
In [ ]:
 

Questions: What is relationship between LoanOriginalAmount and term?

In [37]:
#Show barchart for the relationship between LoanOriginalAmount and term
plt.figure(figsize = [15, 6])
color_pal = sns.color_palette()[0]
sns.barplot(data= data_df, x= 'Term', y = 'LoanOriginalAmount' , color = color_pal)
plt.xlabel("Term")
plt.ylabel("LoanOriginalAmount")
plt.title("The Sum Distribution of Loan Original Amount in respect to the Terms ", size =20)
Out[37]:
Text(0.5, 1.0, 'The Sum Distribution of Loan Original Amount in respect to the Terms ')
  • The Visualiation shows that higher the orignal amount of the loan, the higher length of the loan express in months

Question: what is the relationship between IncomeRange and CreditScoreRangeUpper?

In [38]:
#Show barchart for the relationship between IncomeRange and CreditScoreRangeUpper
plt.figure(figsize = [15, 7])
color_pal = sns.color_palette()[0]
order_type = ['$0', '$1-24,999', '$25,000-49,999', '$50,000-74,999','$75,000-99,999', '$100,000+', 'not displayed', 'Not employed' ]
sns.barplot(data= data_df, x= 'IncomeRange', y = 'CreditScoreRangeUpper' , color = color_pal, order = order_type)
plt.xlabel("IncomeRange")
plt.ylabel("CreditScoreRangeUpper")
plt.title("The Sum Distribution of Credit Score Range Upper in respect to the Income Range ", size =20)
Out[38]:
Text(0.5, 1.0, 'The Sum Distribution of Credit Score Range Upper in respect to the Income Range ')
  • The visualization shows that the higher Income range, the higher The upper value representing the range of the borrower's credit

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶

During the Investigation, the relationship between two features were studied. i discovered that, of all the numerical features in the datasets, the correlation between the Borrower APR and Borrower Rate was the highest which is 0.99. moreso, It was discovered that the borrower interest Rate and loan original amount are negatively correlated, which means the more the loan amount, the lower the Borrower Rate. however,The loan original amount is positively correlated to the stated monthly income. That is, the higher their stated monthly income, the higher the loan amount borrowed.

The relationship between the each Employment status and original loan amount shows that those who are employed and fulltime on average take out larger loans than other groups. in the case of original loan amount and terms,their relationship indicates that the higher the orignal amount of the loan, the higher length of the loan express in months.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶

Apart from the main features of interest, I also observed that:

  • There more of only 1 investors

  • The LoanStatus shows that we have more loan that are currently active even more than those that are completed. it was also observed that there is a very rare case of a loan being cancelled.

  • There are less recommedations (mostly 0)

  • Borrowers with income ranging from ($)50,000-100,000+ are majorly homeowners

Multivariate Exploration¶

In [39]:
data_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 27 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   ListingKey              113937 non-null  object 
 1   MemberKey               113937 non-null  object 
 2   EmploymentStatus        111682 non-null  object 
 3   Occupation              110349 non-null  object 
 4   IncomeVerifiable        113937 non-null  bool   
 5   StatedMonthlyIncome     113937 non-null  float64
 6   IncomeRange             113937 non-null  object 
 7   BorrowerAPR             113912 non-null  float64
 8   BorrowerRate            113937 non-null  float64
 9   BorrowerState           108422 non-null  object 
 10  DebtToIncomeRatio       105383 non-null  float64
 11  IsBorrowerHomeowner     113937 non-null  bool   
 12  CreditScoreRangeLower   113346 non-null  float64
 13  CreditScoreRangeUpper   113346 non-null  float64
 14  LoanKey                 113937 non-null  object 
 15  LoanOriginalAmount      113937 non-null  int64  
 16  LoanOriginationQuarter  113937 non-null  object 
 17  Term                    113937 non-null  int64  
 18  TotalProsperLoans       22085 non-null   float64
 19  ListingCreationDay      113937 non-null  object 
 20  ListingCreationMonth    113937 non-null  object 
 21  LoanOriginationDay      113937 non-null  object 
 22  LoanOriginationMonth    113937 non-null  object 
 23  LoanOriginationYear     113937 non-null  int64  
 24  ClosedDay               55089 non-null   object 
 25  ClosedMonth             55089 non-null   object 
 26  ClosedYear              55089 non-null   float64
dtypes: bool(2), float64(8), int64(3), object(14)
memory usage: 21.9+ MB

Questions: What is the Distribution of Loan Original Amount for the Terms per IncomeVerifiable

In [40]:
#Show barchart for the realationship between LoanOriginalAmount and term
plt.figure(figsize = [15, 6])
color_pal = sns.color_palette()[0]
sns.barplot(data= data_df, x= 'Term', y = 'LoanOriginalAmount' , hue = 'IncomeVerifiable', color = color_pal)
plt.xlabel("Term")
plt.ylabel("LoanOriginalAmount")
plt.title("The Sum Distribution of Loan Original Amount vs Term in IncomeVerifiable ")
Out[40]:
Text(0.5, 1.0, 'The Sum Distribution of Loan Original Amount vs Term in IncomeVerifiable ')
  • The visulaizations shows that those with an Income verifiable tends to borrow an higher amount of loan and have a longer term

Question: what is the influence of the Term on relationship between DebtToIncomeRatio and loan amount

In [41]:
#Show barchart for the ralationship between LoanOriginalAmount and DebtToIncomeRatio with respect to the tems
plt.figure(figsize = [15, 6])
color_pal = sns.color_palette()[0]
sns.barplot(data= data_df, x= 'LoanOriginationYear', y = 'DebtToIncomeRatio' , hue = 'Term', color = color_pal)
plt.xlabel("Term")
plt.ylabel("LoanOriginalAmount")
plt.title("The Sum Distribution of Loan Original Amount vs  Debt To Income Ratio in respect to the Terms")
Out[41]:
Text(0.5, 1.0, 'The Sum Distribution of Loan Original Amount vs  Debt To Income Ratio in respect to the Terms')
  • The visulaizations shows that the Debt To Income were higher in 2007 when we have a term of 35

Question:what is the influence of the Term on relationship between BorrowerRate and loan amount

In [42]:
# Check the influence of the Term on relationship between BorrowerRate and loan amount
group=sns.FacetGrid(data=data_df, aspect=1.2, height=4, col='Term', col_wrap=3)
group.map(sns.regplot, 'LoanOriginalAmount', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1});
group.add_legend();

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

In this part of investigation, it shows that those with an Income verifiable tends to borrow an higher amount of loan and have a longer term. Also Debt To Income were higher in 2007 when we have a term of 35

Were there any interesting or surprising interactions between features?¶

Those with verifiable Income tends to borrow an higher amount of loan and have a longer term than those with no verifiable income.

Conclusion¶

During the investigation of the Prosper Loan, I focused more on the specific features of interest. Exploratory analysis was performed on individual variables as well as the relationship between two or more variables.

From the Univariant Analysis, It was deduced that Most borrowers are employed,very few of them are retired, they are mostly Professional and least of them are clerical. many of the borrower have an income range from ($)25,000-74,999 and most of them have stated monthly income that is less than 5000 Their income ratio is right skewed. Exploring the distribution of different states,it was found out that the state CA have the highest number of borrowers. Most loans were Listed and originated on Tuesday while the least was Listed on Sunday and originated on monday. This year 2013 have the highest count of loan origination

From the Bivariant Exploration during which the relationship between two features were studied. At first, the correlations between different numerical features was discovered with heatmap, the correlation between the Borrower APR and Borrower Rate was the highest which is 0.99. moreso, It was discovered that the borrower interest Rate and loan original amount are negatively correlated, which means the more the loan amount, the lower the Borrower Rate. however,The loan original amount is positively correlated to the stated monthly income. That is, the higher their stated monthly income, the higher the loan amount borrowed.

The relationship between the each Employment status and original loan amount shows that those who are employed and fulltime on average take out larger loans than other groups. in the case of original loan amount and terms,their relationship indicates that the higher the orignal amount of the loan, the higher length of the loan express in months.

Investigating further on Multivariate Exploration where relationship between three or more features were studied and also explore how each of those feature influenced one another. It was deduced that those with an Income verifiable tends to borrow an higher amount of loan and have a longer term. Also Debt To Income were higher in 2007 when we have a term of 35 months

Reference

  • Google
  • Stackoverflow